System and method for detecting personal experience event reports from user generated internet content

ABSTRACT

A method implementable on a computing device for detecting personal experience event reports from user generated content on the Internet is disclosed. The method includes filtering a collection of Internet posts to include only the Internet posts containing personal experience terms. The method additionally includes further filtering the filtered Internet posts by removing the Internet posts with non-personal experience terms.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation application claiming benefit fromU.S. patent application Ser. No. 13/253,090 filed Oct. 5, 2011 which ishereby incorporated in its entirety by reference.

FIELD OF THE INVENTION

The present invention relates to Internet search engines generally andto customized search engines for user generated experience reports inparticular.

BACKGROUND OF THE INVENTION

The Internet contains a plethora of reports that are at least somewhatrelated to consumer products and services. The sources for these reportsare varied. For example, manufacturer/providers may provide informationas part of their marketing efforts. Their competitors may provideconflicting information to promote competing products and services.Nominally disinterested parties provide independent reviews, althoughsuch reviews are often prejudiced by concerns not readily apparent tothe reader. Such products and services are also often mentioned “by theway” as background for other subjects, making it difficult to weed out“true” reports from a multitude of “hits” received when usingconventional Internet search engines.

The Internet also contains “forum” sites where users can post opinionsand discuss various issues of interest. Some of the user posts on suchsites constitute “personal experience” reports wherein consumers discusstheir actual personal experiences using products and services. A typicalsuch personal experience would be something like: “I used product X andmy digestion improved immediately.” In such manner, forum sites mayprovide valuable firsthand information from actual consumers of productsand services.

Unfortunately, personal experience event reports are typically posted infree text with only nominal constraints on form or content, renderingthem unstructured and difficult to identify by non-manual processes. Itis therefore be difficult to identify and collate personal experienceevent reports using conventional Internet search engines, even when suchsearch engines are configured to search forum sites.

SUMMARY OF THE PRESENT INVENTION

There is provided, in accordance with an embodiment of the presentinvention, a method implementable on a computing device for detectingpersonal experience event reports from user generated content on theInternet. The method may include filtering a collection of Internetposts to include only the Internet posts containing personal experienceterms, and further filtering the filtered Internet posts by removing theInternet posts with non-personal experience terms.

In accordance with an embodiment of the present invention, the methodmay also include compiling a list of post collection websites, andcollecting the Internet posts according to the list of websites foranalyzing on a periodic basis.

In accordance with an embodiment of the present invention, compiling mayinclude at least one of detecting “good” textual patterns indicative ofan authentic user generated personal experience event report from atraining set of authenticated user generated personal experience eventreports, or detecting “bad” textual patterns indicative of anon-authentic user generated personal experience event report from atraining set of non-valid user generated personal experience eventreports.

In accordance with an embodiment of the present invention, the methodmay additionally include assigning weights to each of the “good” and“bad” textual patterns to reflect a likelihood of the user generatedpersonal experience event reports including each of the “good” and “bad”textual patterns.

In accordance with an embodiment of the present invention, the methodmay additionally include assigning weights to predictive factorsassociated with the authentic and non-authentic user generated personalexperience event reports in the training sets to reflect a likelihood ofthe user generated personal experience event reports being associatedwith some of the predictive factors, where the predictive factorsinclude at least one of external website/page rankings and factorsderived from the training sets.

In accordance with an embodiment of the present invention, the derivedfactors may include at least one of website metadata, number of imagesper page, number of links per page, ratio of authentic user generatedpersonal experience event reports per discussion thread, number ofauthentic user generated product personal experience event reports perwebsite, total anchor terms detected, and total terms detected.

In accordance with an embodiment of the present invention, the methodmay additionally include identifying the candidate websites withInternet posts including terms from at least one of two “anchor”categories, where the anchor categories represent two essentialcomponents of user generated product personal experience reports. Themethod may additionally include collecting at least a sample of theInternet posts from the identified candidate websites. The method mayfurther include scoring each candidate website according to a cumulativeweighted score as per the set of weighted indicators, where apre-defined score threshold indicates a website with user generatedpersonal experience event reports. The method may additionally includeadding the website with the user generated personal experience eventreports to the list of post collection websites.

There is provided, in accordance with an embodiment of the presentinvention, a method for compiling a list of Internet post collectionwebsites, implementable on a computing device, the method includingdetecting “good” textual patterns indicative of an authentic usergenerated personal experience event report from a training set ofauthenticated user generated personal experience event reports, anddetecting “bad” textual patterns indicative of a non-authentic usergenerated personal experience event report from a training set ofnon-valid user generated personal experience event reports.

BRIEF DESCRIPTION OF THE DRAWINGS

The subject matter regarded as the invention is particularly pointed outand distinctly claimed in the concluding portion of the specification.The invention, however, both as to organization and method of operation,together with objects, features, and advantages thereof, may best beunderstood by reference to the following detailed description when readwith the accompanying drawings in which:

FIG. 1 is a block diagram of a novel user-generated personal experienceretrieval system 100, designed and operative in accordance with apreferred embodiment of the present invention;

FIG. 2 is a block diagram of the segment analyzer of the embodiment ofFIG. 1;

FIG. 3 is a block diagram of a novel process to be performed by thesystem of FIG. 1;

FIG. 4 is an illustration of an exemplary Internet post to be analyzedand processed by the system of FIG. 1;

FIGS. 5-7B are illustrations of exemplary scoring tables to be usedduring the process of FIG. 3;

FIG. 8 is a schematic diagram of a novel forum website selectionutility, constructed and operative in accordance with a preferredembodiment of the present invention; and

FIG. 9 is a block diagram of a novel process to be performed by thesystem of FIG. 8.

It will be appreciated that for simplicity and clarity of illustration,elements shown in the figures have not necessarily been drawn to scale.For example, the dimensions of some of the elements may be exaggeratedrelative to other elements for clarity. Further, where consideredappropriate, reference numerals may be repeated among the figures toindicate corresponding or analogous elements.

DETAILED DESCRIPTION OF THE PRESENT INVENTION

In the following detailed description, numerous specific details are setforth in order to provide a thorough understanding of the invention.However, it will be understood by those skilled in the art that thepresent invention may be practiced without these specific details. Inother instances, well-known methods, procedures, and components have notbeen described in detail so as not to obscure the present invention.

Applicants have realized that currently available Internet searchengines are inefficient tools for searching Internet forums for usergenerated personal experience event reports that may be used to evaluateand compare products and services. An Internet user generated personalexperience event report may be a statement written by users on anInternet platform (such as a message board), referring to their ownexperience with regard to a specific product or service. A specializedsearch process may be configured to identify such reports related to aspecific field of products and/or services in order to filter out “falsehits” and extraneous information that may typically be retrieved by asearch engine.

Reference is now made to FIG. 1 which illustrates a novel user-generatedpersonal experience retrieval system 100, designed and operative inaccordance with a preferred embodiment of the present invention. System100 may comprise post collector 50 in communication with forums 20 onInternet 10. System 100 may also comprise segment analyzer 200, scoringengine 300 and user search interface 350.

In accordance with a preferred embodiment of the present invention,system 100 may be configured to identify user-generated personalexperience event reports that may be related to pharmaceutical products.It will be appreciated that a typical subject for which there may bedemand for collating and analyzing user-generated personal experienceevent reports may be pharmaceuticals. For example, potential users ofpharmaceuticals may understandably wish to study personal experienceevent reports prior to beginning a treatment. To illustrate such anembodiment, system 100 and its methods of operation may therefore bedescribed hereinbelow in the context of a pharmaceutical basedconfiguration. However, it will be appreciated that the presentinvention may be configured for any suitable subject for which personalexperience event reports may be posted on the Internet, for example,automobiles, airline travel, banking services, food and beverages, etc

Post collector 50 may periodically collect posts from a “collectionlist” of chat forums 20 on Internet 10. The collected posts may beforwarded to segment analyzer 200 to identify segments of forum poststhat may be likely to contain personal experience event reportsregarding the subject for which system 100 may be configured. Forexample, segment analyzer may identify post segments that may be likelyto contain personal experience event reports regarding the use ofpharmaceuticals.

These segments may be forwarded to scoring engine 300 which may “score”the segments in terms of their likely relevance as personal reports.Scored segments may then be stored in personal experience database 110along with addressing information, such as a uniform resource locator(URL) for the original post. Users may then use user search interface350 to search database 110 for user-generated personal experience eventreports regarding the products/services for which system 100 may beconfigured. For example, a user may search for event reports relating to“Drug A” in order to find out if anyone that had personally used Drug Ahad reported regarding its success and/or any side effects suffered whenusing it. The output of such a search may consist of a list of chatposts, sorted according to the score assigned by scoring engine 300. Itwill be appreciated that the present invention may include any suitableimplementation for user search interface 350, such as, for example, abrowser based utility for inputting search parameters and displayinglinks to related user generated personal experience event reports.

The collection list used by post collector 50 may include chat forums 20deemed to be relevant to the subject for which system 100 may beconfigured. For example, if system 100 is configured for personalreports on pharmaceutical products, the collection list may include alist on chat forums 20 on which it may be likely that users may postpersonal experience event reports relating to their use ofpharmaceutical products. It will be appreciated that post collector 50may be configured with to include any suitable method such as known inthe art for “scraping” forum posts from the collection list. It willsimilarly be appreciated that post collector 50 may be configuredperform such “scraping” on an incremental basis to avoid reprocessingolder posts.

As will be disclosed hereinbelow, the present invention may also includea novel pre-collection process for compiling the collection list forsystem 100. However, it will be appreciated that the present inventionmay include any suitable method for compiling the collection list,including manual inspection.

Reference is now made to FIG. 2 which illustrates segment analyzer 200in greater detail. Segment analyzer 200 may comprise post filteringmodule 210, anchor detection module 220, basic segmentation unit 230,density calculator 240 and segment optimizer 250. Segment analyzer 200may also comprise filter database 215, anchor database 225 and termsdatabase 235, each of which may be referenced by the other elements ofsegment analyzer 200.

Reference is now also made to FIG. 3 which illustrates a novel postsegmentation process 260 that may be executed by segment analyzer 200 toderive optimally segmented user-generated personal experience eventreports from the posts collected by post collector 50.

Post filtering module 210 may receive (step 262) posts from postcollector 50. Post filtering module 210 may filter (step 264) theseposts according to terms found in filter database 215. Filter database215 may store a list of categorized relevant terms which module 210 maysearch for in each post. Depending on the configuration of system 100,at least one term from a combination of some the categories must befound in a post for that post to pass through the step 264. Thecategories may include, for example, product/service name, indication ofpersonal reference, and indication of personal experience. Theproduct/service name category may consist of names of product/servicesregarding which a user of system 100 may wish to search for personalexperience event reports. It will be appreciated that otherconfigurations for system 100 are included in the present invention. Forexample, if system 100 is configured for automobile research, the termsin the product/services name category may include a list of automobilemakes, manufacturers and nicknames, such as, for example: “Corvette”,“Chevrolet”, “Chevy”, and “Vette”. The category for indications ofpersonal reference may include terms such as “I”, “my”, “me”, “mine”,“myself”, etc. that may indicate that the post refers to an actualpersonal experience. The category for personal experience may includeterms such as, for example, “I used”, “I bought” “I had”, etc. that mayindicate that the poster had an actual personal experience; that thereport was not based on hearsay or opinion. In accordance with apreferred embodiment of the present invention, a post may have tocontain at least one term from each of these categories in order to passthrough step 264.

It will be appreciated, however, that depending on the configuration ofsystem 100 there may be other term categories in filter database 215.For example, if system 100 is configured for pharmaceuticals, therelevant terms may be divided into five categories: Drug name (i.e.product/service name), indication of personal reference, indication ofpersonal drug experience, symptom, and personal symptom experience.Symptom terms may be precise medical terms, such as, for example,“headache”, or alternatively they may also include user descriptionssuch as “my head exploded”. Personal symptom experience terms may beindicative of the poster having a personal cause/reason for using theindicated drug, for example: “I suffered from”, “I have experienced”. Inaccordance with a preferred embodiment of the present invention, whensystem 100 may be configured for pharmaceuticals, terms from all fivecategories must be present in a post in order for it to pass throughstep 264. In accordance with an alternative preferred embodiment, postfiltering module may be configured to require terms from only fourcategories, wherein a term from only one of the personal experience andpersonal symptom experience categories may be required. It will beappreciated that similar categories may be used to configure system 100for non-pharmaceutical products and/or services. For example, if system100 is configured for automobile research, the symptom category may bereplaced by a “preference category” including terms such as “familycar”, “sports car”, “road handling” or “seven seats”. Similarly, thepersonal symptom experience category may be replaced by a personalpreference category including terms such as “I need a bigger car”, “Iwanted a sports car” or “I value engine performance”.

Anchor detection module 220 may detect (step 266) segment anchors inposts that contain all of the required term categories. Module 220 mayreference database 225 for lists of segment anchor terms to match toterms in the posts. Segment anchors may represent a pair of termcategories that may together define the personal experience eventreports of interest for system 100. For example, in a pharmaceuticalconfiguration, the segment anchors may be the drug name and symptomcategories. Alternatively, the segment anchors may be the drug name andpersonal symptom experience categories. In accordance with a preferredembodiment of the present invention, segment anchors for apharmaceutical configuration may be terms from the drug name and symptomcategories. Database 225 may be populated by a publicly availabledatabase of drugs and symptoms.

Basic segmentation unit 230 may then segment (step 268) the posts basedon the anchors identified in step 266 to find the minimal text segmentsin the post that have at least one term from each of the categoriesrequired for the filter process in step 264. Unit 230 may first searchfor the required terms between the identified anchors and may thenincrementally search before and after the anchors one word at a timeuntil at least one of the terms from all of the relevant categories maybe identified in order to define basic segments.

Density calculator 240 may reference terms database 235 to calculate(step 270) the density of relevant terms in each basic segment. Thedensity may be defined as the ratio of the relevant terms eachmultiplied by an associated weight stored in database 235, divided bythe overall number of words in the basic segment. It will be appreciatedthat each term in database 235 may have a different defined weight thatmay reflect its value as a predictor of the likelihood that the postbeing analyzed may represent a user generated personal experience eventreport. Accordingly, the calculated density score may provide a measureof the amount of relevant information contained in the specifiedsegment. It will be appreciated that any suitable method may be used toassign the weights. As will be described hereinbelow, in accordance witha preferred embodiment of the present invention, linear regressions maybe run on a training set of data to derive these weights.

It will also be appreciated that some of the terms may have negativevalues. In addition to the terms in filter database 215, terms database235 may also store other categories of terms that may also be used toassess the likelihood of a segment containing a valid user-generatedpersonal experience event report. For example, terms database 235 mayalso store terms relating to a “negative” category. Terms such as “heardof”, “likely”, “I've been told”, “did not” may typically impactnegatively on the likelihood that a given report is a true personalexperience, and may therefore be significant when assessing a givensegment at the next step of the process. Depending on the configurationof system 100, other categories may be added as well. For example, in anexemplary configuration for pharmaceuticals, there may be an “outcome”or “result” category that may include terms such as “got better”,“recovered” or “condition worsened”. As in the embodiments describedhereinabove, each term in such a category may be weighted to reflect itsvalue as a predictor of the likelihood that the post being analyzed mayrepresent a user generated personal experience event report.

Segment optimizer 250 may incrementally check each word before and afterthe segment to find (step 272) the next term from database 235. Densitycalculator 240 may then recalculate (step 274) the density as in step270. If the result is that density has increased (step 276), segmentoptimizer may again find (step 272) the next term. Steps 272 and 274 maybe repeated until the density ceases to increase (step 276) at whichpoint the final, presumably optimized, segment may be output by segmentanalyzer 200.

Reference is now made to FIG. 4 which illustrates an exemplary post asanalyzed by segment analyzer 200. Terms 282 and 284 may represent anchorterms, “symptom” and “drug name” respectively. Term 281 may represent apersonal experience term, terms 288 may represent personal referenceterms, and terms 289 may represent negative terms. It will beappreciated that there may be two sets of anchor terms 282 and 284.Segment analyzer may use density calculator 240 to compare the densityof the two sets in order to define a basic segment 285. Segment analyzer200 may use terms 282A and 284A to define basic segment 285 since theyreflect a denser segment; they “enclose” personal experience term 281,whereas terms 282B and 284B are much farther away from term 281. Asdescribed hereinabove, segment analyzer 200 may optimize basic segment285 by expanding it to include additional terms and recalculatingdensity (steps 272 and 274). Accordingly, an exemplary optimal segment290 may be defined by expanding basic segment 285 to include terms 287and 288A as well. It will also be appreciated that the second and thirdsentences may contain several negative terms 289, which may decrease thelikelihood that an optimal segment may be in found in those sentences.

Reference is now made to FIG. 5 which illustrates an exemplary factorweight table 305, suitable for use with a pharmaceutical configurationof system 100. Scoring engine 300 may use such a table to “score” theoptimized segments received from segment analyzer 200 in order to assessthe likelihood that they may contain relevant user-generated personalexperience event reports. Each factor 310 may represent a possiblesituation that may occur in a segment, and may be weighted to reflectthe effect of such a situation on the likelihood that a post may indeedbe a relevant user-generated personal experience event report. It willbe appreciated that any suitable method may be used to assign theweights. As will be described hereinbelow, in accordance with apreferred embodiment of the present invention, linear regressions may berun on a training set of data to derive these weights.

For example, high concept density, i.e high density as calculated bydensity calculator 240, may likely indicate that a post may indeed be arelevant user-generated personal experience event report. On the otherhand, the appearance of a second drug between the anchors may lessenthis likelihood, and accordingly may be given a negative weight, forexample: −5. The proximity of terms may also reflect on the likelihoodthat a post may indeed be a relevant user-generated personal experienceevent report. For example, the farther apart a drug or experience and anassociated side effect term may be mentioned in the segment, the lesslikely that they represent a “true” personal experience event report forthat drug. Accordingly, proximity factors may be assigned negativeweights. It will be appreciated that the exemplary values in table 305may be derived from statistical modeling of actual pharmaceuticalrelated forum posts. However, the present invention may also includeother feature-weight sets for both pharmaceutical and otherconfigurations.

FIG. 6, to which reference is now made, illustrates table 305 (nowlabeled 305′) with exemplary values added based on an exemplary postsegment. In order to score the post, scoring engine 300 may multiplyeach factor value per its associated weight, and then add the productsfor the final score. The score for these exemplary values would thus becomputed as:

Score=23*(−2)+1*(−3)+0*(−5)+0*(−5)+9*1+0.34*2+0*4+1*(−10)+1*10+0*(−10)=−39.28

A negative score may indicate that the likelihood of a relevant reportmay be low. System 100 may be configured to store all posts with a scoreabove a certain threshold in personal experience database 110.

FIGS. 7A and 7B, to which reference is now made, show the scoring fortwo exemplary post segments referring to “Drug B”. FIG. 7A shows a scoreof +14.83, whereas FIG. 7B shows a score of −14.46. The salientdifferences between the two examples may be that the example in FIG. 7Ahas an explicit “symptom experience (i.e. “no sex drive”) and lacks anegating factor; whereas the example in FIG. 7B has a negating factor(“heard”) and lacks an explicit symptom experience (“can cause” whichmay indicate a lack of actual experience). Accordingly, the post fromFIG. 7A may be determined to qualify as a user generated personalexperience event report, whereas, the post from FIG. 7B may not. It willbe appreciated that the threshold for qualification may be configurable.

It will be appreciated that it may not be possible to continuouslyperform comprehensive searches for user generated personal experienceevent reports from among all of the content available on the Internet.By necessity, the “collection list” referred to hereinabove maytherefore represent only a small fraction of the websites on theInternet. In accordance with a preferred embodiment of the presentinvention, a forum website selection utility may be used to identifyappropriate websites for collection by post collector 50, thus reducingthe “universe” of websites for post collection to a manageable number ofrelevant websites with non-commercial/SPAM authentic user generatedpersonal experience event reports. Reference is now made to FIG. 8 whichillustrates forum website selection utility 400, constructed andoperative in accordance with a preferred embodiment of the presentinvention.

Utility 400 may comprise pre-collection post collector 450, patternrecognizer 430, training set scoring engine 440 and candidate scoringengine 460. Utility 400 may communicate with Internet 10 via postcollector 450, which may be configured with functionality for collectingposts from Internet websites similar to that of post collector 50. Asmay be described hereinbelow, pre-collection post collector 450 maycollect Internet posts from training and candidate websites as part of aprocess to generate website collection list 465, whereas post collector50 may collect posts from the websites in collection list 465.

Reference is also made to FIG. 9 which illustrates a novel websiteselection process 500 to be performed by utility 400 in accordance witha preferred embodiment of the present invention. Pre-collection postcollector 450 may collect (step 510) posts from a training set ofwebsites that may include “good” websites 405 which may be known to haveuser generated personal experience event reports. In accordance with analternative preferred embodiment of the present invention, the trainingset may also include “bad” websites 410, which may be known to havecontent related to the search subject (i.e. pharmaceuticals, cars, etcdepending on the configuration of system 100) which may not qualify asuser generated personal experience event reports.

“Good” websites 405 may be defined by any suitable method. For example,a generic search engine may be used to locate websites according torelevant keywords, and at least a subset of the website's content may bemanually examined to determine whether or not the website includes usergenerated personal experience event reports. In accordance with apreferred embodiment of the present invention, the posts collected bypre-collection post collector 450 may be filtered to contain onlyverified authentic user generated personal experience event reports. Therelevant keywords may be provided by an outside source such as knownrelevant terms database 425. For example, if system may be configuredfor pharmaceuticals, database 425 may be a publicly available databaseof medical terms that may include comprehensive lists of drugs and knownsymptoms. Similar methods may also be used to define “bad” websites.

Pattern recognizer 430 may detect (step 520) recurring patterns in thetraining set posts. It will be appreciated that any known, suitablemethods for pattern detection/recognition may be used in the context ofstep 430. For example, such detection may include starting by searchingfor instances of terms from known relevant terms database 425. Inaccordance with a preferred embodiment of the present invention,database 425 may contain examples of at least one (and preferably both)of the anchor categories for which system 100 may be configured. Forexample, database 425 may contain a list of drugs and known symptoms. Itwill be appreciated that database 425 may provide the basis for anchordatabase 225.

Step 430 may also include detection of recurring terms that may not befound in database 425. For example, indications of personalreference/experience terms such as those in filter database 215 may alsobe detected. Exemplary such terms may include phrases such as: “I took”or “I felt better”. In accordance with a preferred embodiment of thepresent invention, filter database 215 may be at least in part populatedbased on some or all of the terms detected in step 430.

It will be appreciated that some of the recurring terms detected by step430 may be “negative” in nature. For example, terms such as “buy”,“sale”, “selling” may indicate an attempt to sell or market a productand that the post may therefore not be an authentic user generatedpersonal experience event report. Such terms may typically be found inposts on bad websites 410.

It will be appreciated that step 520 may include detection of largerexpressions as well. For example, a “moving window” may be used to checkfor recurring combination expressions including one or more of theanchor terms from database 425. For example, in the text: “this morningI took Drug A and less than an hour later my headache was gone,” patternrecognizer 430 may initially detect anchors “Drug A” (drug name) and“headache (symptom). By incrementally employing a moving window todetect combination expression around these anchors, pattern recognizermay also detect larger expressions such as personal experience term “Itook” in juxtaposition to anchor term “Drug A”, and a variant on theinitial symptom term, “headache was gone”. Pattern recognizer 430 may beconfigured do perform statistical analysis on the terms detected in step520 to track their occurrences and determine their significance.

It will be appreciated that utility 400 may be configured to facilitateinspection of the results of step 520 by a user of system 100, and toenable the user to adjust the input data as necessary to achieve a truerresult. Accordingly, step 520 may be repeated as necessary. The patternsdetected by pattern recognizer 430 may be stored in detected patternsdatabase 415.

Training set scoring engine 440 may score (step 530) the terms indetected patterns database 415 to produce weighted indicators of thelikelihood that a given website may or may not contain user generatedpersonal experience event reports. Such scoring may employ any suitablemethod. For example, engine 440 may run a linear regression on the termsin detect patterns database 415 vis-à-vis the training set of posts from“good” and “bad” websites to determine the weight of each term as anindicator of likelihood that a given website is either “good” or “bad”.

In accordance with a preferred embodiment of the present invention,engine 440 may expand the scoring process to also include otherindicators from ranking sources database 470. Database 470 may representrankings from external sources such as, for example, Google page ranksand/or Alexa ratings. Engine 440 may include the associated rankings forthe page on which each post may be located as additional factors whenrunning the linear regression on the terms in detect patterns database415.

In accordance with a preferred embodiment of the present invention,engine 440 may expand the scoring process to also include additionalfactors that may be calculated or derived from the original posts. Suchadditional factors may include, for example, the query rank of theoriginal query that identified the post as a candidate and meta keywordsof the page.

In accordance with a preferred embodiment of the present invention,engine 440 may expand the scoring process to also include the number ofimages and/or links on the page. It will be appreciated that most userforums have relatively few images and links per page. Accordingly, ahigher number of links or images per page may tend to indicate a “bad”website.

In accordance with a preferred embodiment of the present invention,engine 440 may also expand the scoring process to also includestatistical data from cumulative scoring. Such factors may include, forexample, the ratio of posts to the number of discussion (aka “threads”);or the overall ranking of a given anchor and/or term in “good” and “bad”websites. For example, the anchor term “Aspirin” may have an overallhigh ranking in “good” posts; statistically, personal experience eventreports citing Aspirin may typically be genuine. However, the anchorterm “Viagra” may typically be indicative of SPAM or commercial posts.

It will be appreciated that utility 400 may be configured to facilitateinspection of the results of step 530 by a user of system 100, and toenable the user to adjust the input data as necessary to achieve a truerresult. Accordingly, step 530 may be repeated as necessary. The patternsscored by engine 440 may be stored in weighted indicators database 435.It will be appreciated that weighted indicators database 435 maytherefore contain a superset (including calculated weights) of the termsin detected patterns database 415 and known relevant terms 425. It willalso be appreciated that database 435 may provide the basis for termsdatabase 235.

Pre-collection post collector 450 may collect (step 540) posts fromcandidate websites 420 on the Internet by formulating search queriesbased on positive term based indicators from weighted indicatorsdatabase 435. Candidate scoring engine 460 may then score (step 550)each website 420 vis-à-vis all of the factors in weighted indicatorsdatabase 435 to assess its likelihood to contain user generated personalexperience event reports. System 100 may be configured with a thresholdweighted score to determine whether or not a given website 420 may beconsidered likely to contain user generated personal experience eventreports.

Utility 400 may update (step 560) website collection list 465 to includewebsites 420 that exceed such a threshold. It will be appreciated thatprocess 500 may be performed on a periodic basis to continually updatelist 465. Accordingly, utility 400 may also record websites 420 withweighted scores below the threshold to avoid examining them again in thefuture.

It will be appreciated that website collection list 465 may be used bypost collector 50 in the embodiment of FIG. 1.

Unless specifically stated otherwise, as apparent from the precedingdiscussions, it is appreciated that, throughout the specification,discussions utilizing terms such as “processing,” “computing,”“calculating,” “determining,” or the like, refer to the action and/orprocesses of a computer, computing system, or similar electroniccomputing device that manipulates and/or transforms data represented asphysical, such as electronic, quantities within the computing system'sregisters and/or memories into other data similarly represented asphysical quantities within the computing system's memories, registers orother such information storage, transmission or display devices.

Embodiments of the present invention may include apparatus forperforming the operations herein. This apparatus may be speciallyconstructed for the desired purposes, or it may comprise ageneral-purpose computer selectively activated or reconfigured by acomputer program stored in the computer. Such a computer program may bestored in a computer readable storage medium, such as, but not limitedto, any type of disk, including floppy disks, optical disks,magnetic-optical disks, read-only memories (ROMs), compact discread-only memories (CD-ROMs), random access memories (RAMs),electrically programmable read-only memories (EPROMs), electricallyerasable and programmable read only memories (EEPROMs), magnetic oroptical cards, Flash memory, or any other type of media suitable forstoring electronic instructions and capable of being coupled to acomputer system bus.

The processes and displays presented herein are not inherently relatedto any particular computer or other apparatus. Various general-purposesystems may be used with programs in accordance with the teachingsherein, or it may prove convenient to construct a more specializedapparatus to perform the desired method. The desired structure for avariety of these systems will appear from the description below. Inaddition, embodiments of the present invention are not described withreference to any particular programming language. It will be appreciatedthat a variety of programming languages may be used to implement theteachings of the invention as described herein.

While certain features of the invention have been illustrated anddescribed herein, many modifications, substitutions, changes, andequivalents will now occur to those of ordinary skill in the art. It is,therefore, to be understood that the appended claims are intended tocover all such modifications and changes as fall within the true spiritof the invention.

What is claimed is:
 1. A method for detecting personal experience eventreports from user generated content on the Internet, implementable on acomputing device, the method comprising: filtering a collection ofInternet posts to include only said Internet posts containing personalexperience terms; and further filtering said filtered Internet posts byremoving said Internet posts with non-personal experience terms.
 2. Amethod according to claim 1 and also comprising: compiling a list ofpost collection websites; and collecting said Internet posts accordingto said list of websites for analyzing on a periodic basis.
 3. A methodaccording to claim 2 and wherein said compiling comprises at least oneof: detecting “good” textual patterns indicative of an authentic usergenerated personal experience event report from a training set ofauthenticated user generated personal experience event reports; ordetecting “bad” textual patterns indicative of a non-authentic usergenerated personal experience event report from a training set ofnon-valid user generated personal experience event reports.
 4. A methodaccording to claim 3 and also comprising: assigning weights to each ofsaid “good” and “bad” textual patterns to reflect a likelihood of saiduser generated personal experience event reports including each of said“good” and “bad” textual patterns
 5. A method according to claim 4 andalso comprising: assigning weights to predictive factors associated withsaid authentic and non-authentic user generated personal experienceevent reports in said training sets to reflect a likelihood of said usergenerated personal experience event reports being associated with atleast some of said predictive factors, wherein said predictive factorsinclude at least one of external website/page rankings and factorsderived from said training sets.
 6. A method according to claim 5 andwherein said derived factors include at least one of website metadata,number of images per page, number of links per page, ratio of authenticuser generated personal experience event reports per discussion thread,number of authentic user generated product personal experience eventreports per website, total anchor terms detected, and total termsdetected.
 7. A method according to claim 6 and also comprisingidentifying said candidate websites with Internet posts including termsfrom at least one of two “anchor” categories, wherein said anchorcategories represent two essential components of user generated productpersonal experience reports; collecting at least a sample of saidInternet posts from said identified candidate websites; scoring eachcandidate website according to a cumulative weighted score as per saidset of weighted indicators, wherein a pre-defined score thresholdindicates a website with user generated personal experience eventreports; and adding said website with said user generated personalexperience event reports to said list of post collection websites.
 8. Amethod for compiling a list of Internet post collection websites,implementable on a computing device, the method comprising: detecting“good” textual patterns indicative of an authentic user generatedpersonal experience event report from a training set of authenticateduser generated personal experience event reports; and detecting “bad”textual patterns indicative of a non-authentic user generated personalexperience event report from a training set of non-valid user generatedpersonal experience event reports.